Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding

نویسندگان

چکیده

Most pooling methods in state-of-the-art speaker embedding networks are implemented the temporal domain. However, due to high non-stationarity feature maps produced from last frame-level layer, it is not advantageous use global statistics (e.g., means and standard deviations) of as aggregated embeddings. This motivates us explore stationary spectral representations perform aggregation In this paper, we propose attentive short-time (attentive STSP) a Fourier perspective exploit local stationarity maps. STSP, for each utterance, compute through weighted average windowed segments within spectrogram by attention weights aggregate their lowest components form embedding. Because most map energy concentrated low-frequency region domain, STSP facilitates information retaining low only. Attentive shown consistently outperform on VoxCeleb1, VOiCES19-eval, SRE16-eval, SRE18-CMN2-eval. observation suggests that applying segment-level leveraging can produce discriminative

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

the search for the self in becketts theatre: waiting for godot and endgame

this thesis is based upon the works of samuel beckett. one of the greatest writers of contemporary literature. here, i have tried to focus on one of the main themes in becketts works: the search for the real "me" or the real self, which is not only a problem to be solved for beckett man but also for each of us. i have tried to show becketts techniques in approaching this unattainable goal, base...

15 صفحه اول

Using Exciting and Spectral Envelope Information and Matrix Quantization for Improvement of the Speaker Verification Systems

Speaker verification from talking a few words of sentences has many applications. Many methods as DTW, HMM, VQ and MQ can be used for speaker verification. We applied MQ for its precise, reliable and robust performance with computational simplicity. We also used pitch frequency and log gain contour for further improvement of the system performance.

متن کامل

Using Exciting and Spectral Envelope Information and Matrix Quantization for Improvement of the Speaker Verification Systems

متن کامل

Aggregating Frame-level Features for Large-Scale Video Classification

This paper introduces the system we developed for the Google Cloud & YouTube-8M Video Understanding Challenge, which can be considered as a multi-label classification problem defined on top of the large scale YouTube-8M Dataset [1]. We employ a large set of techniques to aggregate the provided frame-level feature representations and generate video-level predictions, including several variants o...

متن کامل

using exciting and spectral envelope information and matrix quantization for improvement of the speaker verification systems

speaker verification from talking a few words of sentences has many applications. many methods as dtw, hmm, vq and mq can be used for speaker verification. we applied mq for its precise, reliable and robust performance with computational simplicity. we also used pitch frequency and log gain contour for further improvement of the system performance.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2022

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2022.3153267